Loan Data Exploration by Xue Bai

The data set contains 113,937 loans with 81 variables on each loan. I choose 10 varibles frome this data set to explore. The variables I choose are: loan original date, term, loan status, borrower rate, listing category, occupation, employment, monthly loan payment, loan original amount and income range.

Univariate Analysis

##   EmploymentStatus Term ListingCategory..numeric.    Occupation
## 1    Self-employed   36                         0         Other
## 2         Employed   36                         2  Professional
## 3    Not available   36                         0         Other
## 4         Employed   36                        16 Skilled Labor
## 5         Employed   36                         2     Executive
## 6         Employed   60                         1  Professional
##   MonthlyLoanPayment LoanOriginalAmount    IncomeRange CreditGrade
## 1             330.43               9425 $25,000-49,999           C
## 2             318.93              10000 $50,000-74,999            
## 3             123.32               3001  Not displayed          HR
## 4             321.45              10000 $25,000-49,999            
## 5             563.97              15000      $100,000+            
## 6             342.37              15000      $100,000+            
##   BorrowerState LoanStatus BorrowerRate LoanOriginationDate year
## 1            CO  Completed       0.1580 2007-09-12 00:00:00 2007
## 2            CO    Current       0.0920 2014-03-03 00:00:00 2014
## 3            GA  Completed       0.2750 2007-01-17 00:00:00 2007
## 4            GA    Current       0.0974 2012-11-01 00:00:00 2012
## 5            MN    Current       0.2085 2013-09-20 00:00:00 2013
## 6            NM    Current       0.1314 2013-12-24 00:00:00 2013

Univariate Plots Section

Number of loans in 2005-2014

From this plot we can see that from 2005 to 2008, the number of loans increase. but at 2009, the number of loans suddenly decrease a lot. Then from 2009, the number of loans increase quiet fast and reach the highest at 2013. The dataset only contains data before 3/11/2014, so it is not surprise that the number of loans in 2014 is quiet small.

Term

From this plot we can see that the most people choose to loan for 36 months.

The reason why people loan.

Listing Category

From this plot we can see that most people loan because of debt consolidation.

Loan Status

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Most of the loans are completed or in current. Few of loans past due.

BorrowerState

Form this plot we can find that the number of loans in CA is significantly large. The number of loans in FL,IL,NY,TX is also very large.

Occupation

Employment

Incomerange

From this plot we can find that most people loan has income range between 25,000 to 74,999.

Loan Amount

Most people’s loan original amount is below 27,000.

Credit Grade

Most of the credit grades are leave blank, we need to ignore the blank credit grades when making the Bivariate and mulivariate plot.

Borrower Rate

Most of the borrower rate is between 0.05-0.35 which is a really large rate. The distribution of the rate is like normal distribution.

What is the structure of your dataset?

The data set contains 113,937 loans with 81 variables on each loan. I choose 12 varibles frome this data set to explore. The variables I choose are: loan original date, term, loan status, borrower rate, listing category, occupation, employment, monthly loan payment, loan original amount,income range, credit grade and borrower state.

term:The length of the loan expressed in months.(12,36,60) loan status:Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue listing category:The category of the listing that the borrower selected when posting their listing(including: Not Available, Debt Consolidation, Home Improvement, Business, Personal Loan, Student Use, Auto, Other, Baby&Adoption, Boat, Cosmetic Procedure, Engagement Ring, Green Loans, Household Expenses, Large Purchases, Medical/Dental, Motorcycle, RV, Taxes, Vacation, Wedding Loans

What is/are the main feature(s) of interest in your dataset?

I am interest in what factor influence the rant of loans? And what factor is the most important? The factors I’m interested include: 1. when 2. where 3.term of loan 4.occupation/employment status of borrower 5. loan amount 6. Incomerange of borrower7. credit grade of borrower

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I also explore the reason why people loan.

Did you create any new variables from existing variables in the dataset?

I create year to indicate the year when te loan start.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

I change the Listing Category of loan to words instead of the number to make it easier for people to know what is going.

Bivariate Plots Section

From the plot below we can see that the average borrower rate is different year to year. And there are only few outliers. The average borrower rate is quite small at 2015. From 2006 to 2008, the average borrower rate decrease a little bit. From 2008-2011, the borrower rate increase. And from 2011 to 2014, the average borrower rate decrease.

year-rate

From this plot, we can find that the longer the term is, higher rate is likely.

Term-rate

The difference of loan rate to different employment status is not much. However, the rate for borrower that is not employed is higher than other status.

Employment-rate

From the scatter plot I find the rate variance decrease as the loan original amount increase.But before 25,000 the loan original amount seems has litter relation to the loan original amount. For the loan over 25,000 dollars, the loan rate variance is apparently smaller.

The blue line is the correlation line of borrower rate to loan original amount. From this line, we can find that the larger the loan original amount is, the lower borrowerrate might be. #### LoanAmount-rate

The borrower rate has little relation to the borrower state.

borrowerstate-rate

For borrower rate is lower when the income range of the borrower is increase. But surprisely, the income range of 0 is lowest. There maybe some special loan project for these people.

IncomeRange-rate

Apparently, the higher the credit grade, the lower rate is.

Credit Grade

The relation between category of loan to loan original amount

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in

the dataset?

Some factors I think may have impact on the loan rate actually has little impact on that. The factors I find that have relatively strong impact on loan rate are year of loan, term, income range and credit grade.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

The loan original amount, the borrower state and the employment status has little relation to the loan rate.

What was the strongest relationship you found?

The time of loan is the most important factor of borrower rate.

Multivariate Plots Section

From the plot below we can see that before 2009, the term of loan is always 36 months. After 2009, there are loans for 12, 26 and 60 months. When 12 and 36 months’ loan first appear(at 2010), the average rate is quiet low accordint to the rate for 36 months’ loan. But it increase quiet fast.

time-rate-term

After 2009, no credit grade is available. But we can tell from 2005-2009, in general, the higher the credit grade is, the lower the average borrower rate.

time-rate-credit

The plot below descibe the relation of the average borrower rate in 2005-2014 for people have different income range. I omit the ‘Not displayed’ category of the income range. From the below plot we can see that in genaral the higher the income range, thelower the average borrower rate is. And the basic shape of the average borrower rate to the year is the same.

time-rate-Incomerange

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

The tendency of average loan rate is nearly the same for different terms, different Incomerange and credit grade.

Were there any interesting or surprising interactions between features?

At 2012-2014, the average loan rate decrease, but the average rate for unemployed people increase a lot.


Final Plots and Summary

Plot One

##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

Description One

This plot describes number of loans for different categories. From this plot we can see that the most people loan for Debt Consolidation.

Plot Two

Description Two

This plot shows the loan rate changes in 2005-2014. The red line is the average loan rate for that time. We can see from the plot that the loan rate increase in 2005-2011, and decrease from 2011 to 2014.

Plot Three

Description Three

This plot descibe the relation of the average borrower rate in 2005-2014 for people have different income range. I remove the ‘Not displayed’ category of the income range because it doesn’t contain useful information. From the below plot we can see that in genaral the higher the income range, thelower the average borrower rate is. And the basic shape of the average borrower rate to the year is the same.


Reflection

The data set contains 113,937 loans with 81 variables on each loan. I choose 13 varibles frome this data set to explore its relation to the loan rate. The variables I choose are: loan original date, term, loan status, borrower rate, listing category, occupation, employment, monthly loan payment, loan original amount, income range and borrower state.

I started by understanding the individual variables in the data set, and then I explored the relation of each factor to the loan rate. Then I found the factors which have may relate relatively strong to the loan rate and did some further explorement in these factors.

From the explorement, I find that the relatively strong factors of loan rate are date of loan, credit grade, income range of borrower and term. What surprise me is that the loan amount has little relation to the loan rate.

When doing this project, I find it is hard because many data is blank. For example, the credit grade is only available in 2005-2009 and the many employment status is not displayed. So I have to omit the this data when analyze the data.

I also find that the loan rate is affected by too many features that I can’t describe them all in one plot. So I need to split the features and find out the most important features or the features I am interested in and then display them in different plots.

In this project, we know that the time, credit grade, income range and term has relatively strong relationships with the loan rate. In the further work, maybe I can give each feature a coefficient which can makes us to predict the loan rate.